In the folowing Notebook I tried to identify and analyze the relationships between different stocks in the S&P 500 index using network theory and statistical methods.
Find out the Top n central and peripheral stocks based on the following network topological parameters:
Simulate the performance of Central and Peripheral portfolios against new historical data of the year 2024 for the S&P 500.
Note: This is an open-source notebook that I developed for my personal research as a Data Scientist, it's purpose is not intended for investment recommendations or propositions.
# First lets install the libraries
import tqdm
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import networkx as nx
import plotly.express as px
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas_datareader.data as web
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Read the csv Data file using Pandas
stock_data = pd.read_csv('/content/drive/SP500_price_data_2010_to_2023.csv', index_col=[0])
stock_data
| MMM | AOS | ABT | ABBV | ABMD | ACN | ATVI | ADM | ADBE | ADP | ... | WYNN | XEL | XYL | YUM | ZBRA | ZBH | ZION | ZTS | CEG | OGN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2010-01-04 | 69.414719 | 7.435000 | 26.129908 | NaN | 8.74 | 42.070000 | NaN | 31.469999 | 37.090000 | 37.603161 | ... | 63.959999 | 21.080000 | NaN | 25.226456 | 28.670000 | 58.271847 | 13.330000 | NaN | NaN | NaN |
| 2010-01-05 | 68.979935 | 7.340000 | 25.918797 | NaN | 8.53 | 42.330002 | NaN | 31.639999 | 37.700001 | 37.401230 | ... | 67.849998 | 20.830000 | NaN | 25.140186 | 28.620001 | 60.116505 | 13.800000 | NaN | NaN | NaN |
| 2010-01-06 | 69.958191 | 7.343333 | 26.062737 | NaN | 8.40 | 42.779999 | NaN | 31.559999 | 37.619999 | 37.313435 | ... | 66.959999 | 20.870001 | NaN | 24.960461 | 28.400000 | 60.097088 | 15.000000 | NaN | NaN | NaN |
| 2010-01-07 | 70.008362 | 7.365000 | 26.278646 | NaN | 8.40 | 42.740002 | NaN | 31.230000 | 36.889999 | 37.295876 | ... | 68.389999 | 20.780001 | NaN | 24.953272 | 27.690001 | 61.475727 | 16.680000 | NaN | NaN | NaN |
| 2010-01-08 | 70.501671 | 7.473333 | 26.412991 | NaN | 8.23 | 42.570000 | NaN | 30.840000 | 36.689999 | 37.243195 | ... | 67.900002 | 20.790001 | NaN | 24.960461 | 27.600000 | 60.184467 | 16.410000 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2023-12-22 | 88.904678 | 81.730003 | 109.029999 | 154.940002 | NaN | 354.450012 | NaN | 71.459999 | 598.750000 | 230.970001 | ... | 90.419998 | 62.060001 | 112.639999 | 130.000000 | 269.410004 | 120.559998 | 44.000000 | 194.979996 | 115.330002 | 13.91 |
| 2023-12-26 | 90.392975 | 81.849998 | 109.230003 | 154.619995 | NaN | 353.429993 | NaN | 72.389999 | 598.260010 | 232.539993 | ... | 91.000000 | 62.250000 | 113.620003 | 130.259995 | 275.500000 | 121.419998 | 44.880001 | 195.500000 | 117.570000 | 14.13 |
| 2023-12-27 | 90.919731 | 82.160004 | 109.790001 | 154.880005 | NaN | 353.820007 | NaN | 72.169998 | 596.080017 | 231.610001 | ... | 90.629997 | 61.540001 | 114.379997 | 130.839996 | 275.790009 | 121.129997 | 44.509998 | 196.899994 | 117.419998 | 14.36 |
| 2023-12-28 | 91.714050 | 82.239998 | 110.400002 | 154.750000 | NaN | 351.589996 | NaN | 72.269997 | 595.520020 | 232.490005 | ... | 91.760002 | 61.889999 | 114.320000 | 130.520004 | 275.350006 | 121.629997 | 44.660000 | 197.160004 | 117.110001 | 14.48 |
| 2023-12-29 | 91.404678 | 82.440002 | 110.070000 | 154.970001 | NaN | 350.910004 | NaN | 72.220001 | 596.599976 | 232.970001 | ... | 91.110001 | 61.910000 | 114.360001 | 130.660004 | 273.329987 | 121.699997 | 43.869999 | 197.369995 | 116.889999 | 14.42 |
3522 rows × 504 columns
There are Missing data since certain stocks may move out of the S&P 500 and certain stocks may enter the S&P 500 during this period.
#create a heatmap visualization to display missing values in the stock_data DataFrame.
figure = plt.figure(figsize=(20, 10))
sns.heatmap(stock_data.T.isnull());
#clean the DataFrame stock_data by removing columns that contain any missing values
stock_data_cleaned = stock_data.dropna(axis=1)
# Plot the heatmap
figure = plt.figure(figsize=(20, 10))
sns.heatmap(stock_data_cleaned.T.isnull());
Breaking down stock data by year will help to see it clearer also it is easier to understand market changes.
# Define the function to get year-wise S&P 500 data
def get_year_wise_snp_500_data(data, year):
start_date = '{}-01-01'.format(year)
end_date = '{}-12-31'.format(year)
year_wise_data = data.loc[start_date:end_date]
return year_wise_data
# Separate variable for each year containing the data specifically for that year.
snp_500_2010 = get_year_wise_snp_500_data(stock_data_cleaned, 2010)
snp_500_2011 = get_year_wise_snp_500_data(stock_data_cleaned, 2011)
snp_500_2012 = get_year_wise_snp_500_data(stock_data_cleaned, 2012)
snp_500_2013 = get_year_wise_snp_500_data(stock_data_cleaned, 2013)
snp_500_2014 = get_year_wise_snp_500_data(stock_data_cleaned, 2014)
snp_500_2015 = get_year_wise_snp_500_data(stock_data_cleaned, 2015)
snp_500_2016 = get_year_wise_snp_500_data(stock_data_cleaned, 2016)
snp_500_2017 = get_year_wise_snp_500_data(stock_data_cleaned, 2017)
snp_500_2018 = get_year_wise_snp_500_data(stock_data_cleaned, 2018)
snp_500_2019 = get_year_wise_snp_500_data(stock_data_cleaned, 2019)
snp_500_2020 = get_year_wise_snp_500_data(stock_data_cleaned, 2020)
snp_500_2021 = get_year_wise_snp_500_data(stock_data_cleaned, 2021)
snp_500_2022 = get_year_wise_snp_500_data(stock_data_cleaned, 2022)
snp_500_2023 = get_year_wise_snp_500_data(stock_data_cleaned, 2023)
#looking at the matrix for 2010
snp_500_2010
| MMM | AOS | ABT | ACN | ADM | ADBE | ADP | AAP | AES | AFL | ... | WY | WHR | WMB | WTW | WYNN | XEL | YUM | ZBRA | ZBH | ZION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2010-01-04 | 69.414719 | 7.435000 | 26.129908 | 42.070000 | 31.469999 | 37.090000 | 37.603161 | 40.380001 | 13.67 | 23.785000 | ... | 44.279999 | 81.949997 | 17.616472 | 70.913910 | 63.959999 | 21.080000 | 25.226456 | 28.670000 | 58.271847 | 13.330000 |
| 2010-01-05 | 68.979935 | 7.340000 | 25.918797 | 42.330002 | 31.639999 | 37.700001 | 37.401230 | 40.139999 | 13.53 | 24.475000 | ... | 45.220001 | 81.779999 | 17.836576 | 70.754967 | 67.849998 | 20.830000 | 25.140186 | 28.620001 | 60.116505 | 13.800000 |
| 2010-01-06 | 69.958191 | 7.343333 | 26.062737 | 42.779999 | 31.559999 | 37.619999 | 37.313435 | 40.490002 | 13.39 | 24.690001 | ... | 44.720001 | 81.059998 | 18.415367 | 71.894043 | 66.959999 | 20.870001 | 24.960461 | 28.400000 | 60.097088 | 15.000000 |
| 2010-01-07 | 70.008362 | 7.365000 | 26.278646 | 42.740002 | 31.230000 | 36.889999 | 37.295876 | 40.480000 | 13.40 | 24.955000 | ... | 44.520000 | 82.800003 | 18.284937 | 71.682121 | 68.389999 | 20.780001 | 24.953272 | 27.690001 | 61.475727 | 16.680000 |
| 2010-01-08 | 70.501671 | 7.473333 | 26.412991 | 42.570000 | 30.840000 | 36.689999 | 37.243195 | 40.639999 | 13.78 | 24.705000 | ... | 44.110001 | 84.559998 | 18.431671 | 71.602646 | 67.900002 | 20.790001 | 24.960461 | 27.600000 | 60.184467 | 16.410000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2010-12-27 | 72.750839 | 9.722500 | 22.732924 | 48.250000 | 30.090000 | 30.950001 | 40.886742 | 66.419998 | 12.17 | 28.485001 | ... | 18.709999 | 90.330002 | 20.078377 | 91.072845 | 102.120003 | 23.459999 | 35.600288 | 38.419998 | 52.864079 | 24.100000 |
| 2010-12-28 | 72.525085 | 9.637500 | 22.737722 | 47.980000 | 29.780001 | 30.920000 | 40.851624 | 66.250000 | 12.09 | 28.445000 | ... | 18.840000 | 90.309998 | 20.070225 | 91.682121 | 101.690002 | 23.520000 | 35.391804 | 38.250000 | 52.582523 | 24.459999 |
| 2010-12-29 | 72.541809 | 9.680000 | 22.752115 | 48.619999 | 29.750000 | 30.969999 | 40.798946 | 66.449997 | 12.15 | 28.350000 | ... | 19.000000 | 90.000000 | 20.176201 | 91.629143 | 102.610001 | 23.500000 | 35.657799 | 38.439999 | 52.398060 | 24.270000 |
| 2010-12-30 | 72.357857 | 9.650000 | 22.824085 | 48.419998 | 29.900000 | 30.570000 | 40.763828 | 66.720001 | 12.20 | 28.045000 | ... | 18.950001 | 89.820000 | 20.102833 | 91.814568 | 101.709999 | 23.549999 | 35.449318 | 38.299999 | 52.330097 | 24.270000 |
| 2010-12-31 | 72.157188 | 9.520000 | 22.987219 | 48.490002 | 30.080000 | 30.780001 | 40.632133 | 66.150002 | 12.18 | 28.215000 | ... | 18.930000 | 88.830002 | 20.151745 | 91.735100 | 103.839996 | 23.549999 | 35.262402 | 37.990002 | 52.116505 | 24.230000 |
252 rows × 420 columns
#looking at the matrix for 2010 shifted
snp_500_2010.shift(1)
| MMM | AOS | ABT | ACN | ADM | ADBE | ADP | AAP | AES | AFL | ... | WY | WHR | WMB | WTW | WYNN | XEL | YUM | ZBRA | ZBH | ZION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2010-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2010-01-05 | 69.414719 | 7.435000 | 26.129908 | 42.070000 | 31.469999 | 37.090000 | 37.603161 | 40.380001 | 13.67 | 23.785000 | ... | 44.279999 | 81.949997 | 17.616472 | 70.913910 | 63.959999 | 21.080000 | 25.226456 | 28.670000 | 58.271847 | 13.330000 |
| 2010-01-06 | 68.979935 | 7.340000 | 25.918797 | 42.330002 | 31.639999 | 37.700001 | 37.401230 | 40.139999 | 13.53 | 24.475000 | ... | 45.220001 | 81.779999 | 17.836576 | 70.754967 | 67.849998 | 20.830000 | 25.140186 | 28.620001 | 60.116505 | 13.800000 |
| 2010-01-07 | 69.958191 | 7.343333 | 26.062737 | 42.779999 | 31.559999 | 37.619999 | 37.313435 | 40.490002 | 13.39 | 24.690001 | ... | 44.720001 | 81.059998 | 18.415367 | 71.894043 | 66.959999 | 20.870001 | 24.960461 | 28.400000 | 60.097088 | 15.000000 |
| 2010-01-08 | 70.008362 | 7.365000 | 26.278646 | 42.740002 | 31.230000 | 36.889999 | 37.295876 | 40.480000 | 13.40 | 24.955000 | ... | 44.520000 | 82.800003 | 18.284937 | 71.682121 | 68.389999 | 20.780001 | 24.953272 | 27.690001 | 61.475727 | 16.680000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2010-12-27 | 72.299332 | 9.717500 | 22.939238 | 48.340000 | 30.209999 | 30.850000 | 40.842846 | 66.570000 | 12.23 | 28.555000 | ... | 18.559999 | 89.860001 | 20.070225 | 91.390732 | 103.000000 | 23.469999 | 35.700935 | 38.490002 | 52.834949 | 23.740000 |
| 2010-12-28 | 72.750839 | 9.722500 | 22.732924 | 48.250000 | 30.090000 | 30.950001 | 40.886742 | 66.419998 | 12.17 | 28.485001 | ... | 18.709999 | 90.330002 | 20.078377 | 91.072845 | 102.120003 | 23.459999 | 35.600288 | 38.419998 | 52.864079 | 24.100000 |
| 2010-12-29 | 72.525085 | 9.637500 | 22.737722 | 47.980000 | 29.780001 | 30.920000 | 40.851624 | 66.250000 | 12.09 | 28.445000 | ... | 18.840000 | 90.309998 | 20.070225 | 91.682121 | 101.690002 | 23.520000 | 35.391804 | 38.250000 | 52.582523 | 24.459999 |
| 2010-12-30 | 72.541809 | 9.680000 | 22.752115 | 48.619999 | 29.750000 | 30.969999 | 40.798946 | 66.449997 | 12.15 | 28.350000 | ... | 19.000000 | 90.000000 | 20.176201 | 91.629143 | 102.610001 | 23.500000 | 35.657799 | 38.439999 | 52.398060 | 24.270000 |
| 2010-12-31 | 72.357857 | 9.650000 | 22.824085 | 48.419998 | 29.900000 | 30.570000 | 40.763828 | 66.720001 | 12.20 | 28.045000 | ... | 18.950001 | 89.820000 | 20.102833 | 91.814568 | 101.709999 | 23.549999 | 35.449318 | 38.299999 | 52.330097 | 24.270000 |
252 rows × 420 columns
Above is the same data for 2010 but shifted which will be used for calculating the daily returns comparing each day’s data by subtract the previous day price from the current day price then divide by the previous day price.
$$ \text{Return}_t = \frac{\text{Price}_t - \text{Price}_{t-1}}{\text{Price}_{t-1}} $$For this specific project i will use Log methode since they are more likely to follow a Log normal distribution than simple returns that are more affected by extreme values which can distort the analysis.
$$ \text{Log Return}_t = \ln(\text{Price}_t) - \ln(\text{Price}_{t-1}) = \ln\left(\frac{\text{Price}_t}{\text{Price}_{t-1}}\right) $$# Calculating daily log returns by subtracting between two days with the help of shift function
log_returns_2010 = np.log(snp_500_2010.shift(1)) - np.log(snp_500_2010)
log_returns_2011 = np.log(snp_500_2011.shift(1)) - np.log(snp_500_2011)
log_returns_2012 = np.log(snp_500_2012.shift(1)) - np.log(snp_500_2012)
log_returns_2013 = np.log(snp_500_2013.shift(1)) - np.log(snp_500_2013)
log_returns_2014 = np.log(snp_500_2014.shift(1)) - np.log(snp_500_2014)
log_returns_2015 = np.log(snp_500_2015.shift(1)) - np.log(snp_500_2015)
log_returns_2016 = np.log(snp_500_2016.shift(1)) - np.log(snp_500_2016)
log_returns_2017 = np.log(snp_500_2017.shift(1)) - np.log(snp_500_2017)
log_returns_2018 = np.log(snp_500_2018.shift(1)) - np.log(snp_500_2018)
log_returns_2019 = np.log(snp_500_2019.shift(1)) - np.log(snp_500_2019)
log_returns_2020 = np.log(snp_500_2020.shift(1)) - np.log(snp_500_2020)
log_returns_2021 = np.log(snp_500_2021.shift(1)) - np.log(snp_500_2021)
log_returns_2022 = np.log(snp_500_2022.shift(1)) - np.log(snp_500_2022)
log_returns_2023 = np.log(snp_500_2023.shift(1)) - np.log(snp_500_2023)
# Visualise the Log returns for the year 2010
log_returns_2010
| MMM | AOS | ABT | ACN | ADM | ADBE | ADP | AAP | AES | AFL | ... | WY | WHR | WMB | WTW | WYNN | XEL | YUM | ZBRA | ZBH | ZION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2010-01-04 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2010-01-05 | 0.006283 | 0.012860 | 0.008112 | -0.006161 | -0.005387 | -0.016313 | 0.005385 | 0.005961 | 0.010294 | -0.028597 | ... | -0.021006 | 0.002077 | -0.012417 | 0.002244 | -0.059041 | 0.011930 | 0.003426 | 0.001745 | -0.031165 | -0.034651 |
| 2010-01-06 | -0.014082 | -0.000454 | -0.005538 | -0.010575 | 0.002532 | 0.002124 | 0.002350 | -0.008682 | 0.010401 | -0.008746 | ... | 0.011119 | 0.008843 | -0.031934 | -0.015971 | 0.013204 | -0.001919 | 0.007175 | 0.007717 | 0.000323 | -0.083382 |
| 2010-01-07 | -0.000717 | -0.002946 | -0.008250 | 0.000935 | 0.010511 | 0.019595 | 0.000471 | 0.000247 | -0.000746 | -0.010676 | ... | 0.004482 | -0.021239 | 0.007108 | 0.002952 | -0.021131 | 0.004322 | 0.000288 | 0.025318 | -0.022681 | -0.106160 |
| 2010-01-08 | -0.007022 | -0.014602 | -0.005099 | 0.003986 | 0.012567 | 0.005436 | 0.001414 | -0.003945 | -0.027964 | 0.010069 | ... | 0.009252 | -0.021033 | -0.007993 | 0.001109 | 0.007191 | -0.000481 | -0.000288 | 0.003256 | 0.021228 | 0.016320 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2010-12-27 | -0.006226 | -0.000514 | 0.009035 | 0.001864 | 0.003980 | -0.003236 | -0.001074 | 0.002256 | 0.004918 | 0.002454 | ... | -0.008049 | -0.005217 | -0.000406 | 0.003484 | 0.008580 | 0.000426 | 0.002823 | 0.001820 | -0.000551 | -0.015050 |
| 2010-12-28 | 0.003108 | 0.008781 | -0.000211 | 0.005612 | 0.010356 | 0.000970 | 0.000859 | 0.002563 | 0.006595 | 0.001405 | ... | -0.006924 | 0.000221 | 0.000406 | -0.006668 | 0.004220 | -0.002554 | 0.005873 | 0.004435 | 0.005340 | -0.014827 |
| 2010-12-29 | -0.000231 | -0.004400 | -0.000633 | -0.013251 | 0.001008 | -0.001616 | 0.001290 | -0.003014 | -0.004950 | 0.003345 | ... | -0.008457 | 0.003438 | -0.005266 | 0.000578 | -0.009006 | 0.000851 | -0.007488 | -0.004955 | 0.003514 | 0.007798 |
| 2010-12-30 | 0.002539 | 0.003104 | -0.003158 | 0.004122 | -0.005029 | 0.013000 | 0.000861 | -0.004055 | -0.004107 | 0.010817 | ... | 0.002635 | 0.002002 | 0.003643 | -0.002022 | 0.008810 | -0.002125 | 0.005864 | 0.003649 | 0.001298 | 0.000000 |
| 2010-12-31 | 0.002777 | 0.013563 | -0.007122 | -0.001445 | -0.006002 | -0.006846 | 0.003236 | 0.008580 | 0.001641 | -0.006043 | ... | 0.001056 | 0.011083 | -0.002430 | 0.000866 | -0.020726 | 0.000000 | 0.005287 | 0.008127 | 0.004090 | 0.001650 |
252 rows × 420 columns
The code below is computing the Pearson correlation coefficient for each year separately (2010-2023). Each of these correlation matrices is a representation of the pairwise relationships between the stocks for that specific year.
# Computing adjacency matrix:
return_correlation_2010 = log_returns_2010.corr()
return_correlation_2011 = log_returns_2011.corr()
return_correlation_2012 = log_returns_2012.corr()
return_correlation_2013 = log_returns_2013.corr()
return_correlation_2014 = log_returns_2014.corr()
return_correlation_2015 = log_returns_2015.corr()
return_correlation_2016 = log_returns_2016.corr()
return_correlation_2017 = log_returns_2017.corr()
return_correlation_2018 = log_returns_2018.corr()
return_correlation_2019 = log_returns_2019.corr()
return_correlation_2020 = log_returns_2020.corr()
return_correlation_2021 = log_returns_2021.corr()
return_correlation_2022 = log_returns_2022.corr()
return_correlation_2023 = log_returns_2023.corr()
#Visualise the corr Matrix for the year 2010
return_correlation_2010
| MMM | AOS | ABT | ACN | ADM | ADBE | ADP | AAP | AES | AFL | ... | WY | WHR | WMB | WTW | WYNN | XEL | YUM | ZBRA | ZBH | ZION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MMM | 1.000000 | 0.593484 | 0.483927 | 0.486274 | 0.432142 | 0.334670 | 0.676420 | 0.253205 | 0.583607 | 0.686239 | ... | 0.136096 | 0.622998 | 0.621104 | 0.524046 | 0.541869 | 0.593247 | 0.588807 | 0.525908 | 0.569811 | 0.485266 |
| AOS | 0.593484 | 1.000000 | 0.399469 | 0.346007 | 0.323026 | 0.305744 | 0.527425 | 0.288655 | 0.524470 | 0.591544 | ... | 0.191281 | 0.529962 | 0.558584 | 0.439802 | 0.457564 | 0.467970 | 0.509763 | 0.526935 | 0.388246 | 0.455234 |
| ABT | 0.483927 | 0.399469 | 1.000000 | 0.445754 | 0.442076 | 0.290392 | 0.495372 | 0.156493 | 0.487559 | 0.532127 | ... | 0.121300 | 0.450807 | 0.499627 | 0.415275 | 0.403185 | 0.560456 | 0.451432 | 0.411371 | 0.401872 | 0.416013 |
| ACN | 0.486274 | 0.346007 | 0.445754 | 1.000000 | 0.360045 | 0.255378 | 0.537841 | 0.183239 | 0.457032 | 0.521398 | ... | 0.177261 | 0.472019 | 0.454769 | 0.499239 | 0.442191 | 0.430872 | 0.510435 | 0.438993 | 0.428124 | 0.394382 |
| ADM | 0.432142 | 0.323026 | 0.442076 | 0.360045 | 1.000000 | 0.210313 | 0.506797 | 0.284246 | 0.428716 | 0.489294 | ... | 0.061361 | 0.387251 | 0.458206 | 0.406431 | 0.350003 | 0.480415 | 0.508323 | 0.442256 | 0.344112 | 0.308510 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| XEL | 0.593247 | 0.467970 | 0.560456 | 0.430872 | 0.480415 | 0.357840 | 0.650324 | 0.281257 | 0.613973 | 0.641188 | ... | 0.143083 | 0.515557 | 0.609604 | 0.462807 | 0.488771 | 1.000000 | 0.603322 | 0.524355 | 0.453095 | 0.436663 |
| YUM | 0.588807 | 0.509763 | 0.451432 | 0.510435 | 0.508323 | 0.345217 | 0.695521 | 0.423742 | 0.611902 | 0.703039 | ... | 0.083662 | 0.557554 | 0.590081 | 0.480745 | 0.632180 | 0.603322 | 1.000000 | 0.573824 | 0.438971 | 0.489627 |
| ZBRA | 0.525908 | 0.526935 | 0.411371 | 0.438993 | 0.442256 | 0.386711 | 0.598238 | 0.338250 | 0.536078 | 0.586118 | ... | 0.100419 | 0.516997 | 0.587588 | 0.430601 | 0.499715 | 0.524355 | 0.573824 | 1.000000 | 0.469623 | 0.435329 |
| ZBH | 0.569811 | 0.388246 | 0.401872 | 0.428124 | 0.344112 | 0.297990 | 0.558904 | 0.219691 | 0.423741 | 0.528562 | ... | 0.193002 | 0.510361 | 0.465346 | 0.455785 | 0.451811 | 0.453095 | 0.438971 | 0.469623 | 1.000000 | 0.387305 |
| ZION | 0.485266 | 0.455234 | 0.416013 | 0.394382 | 0.308510 | 0.374793 | 0.505663 | 0.290643 | 0.528292 | 0.618436 | ... | 0.350160 | 0.538962 | 0.516113 | 0.417525 | 0.468152 | 0.436663 | 0.489627 | 0.435329 | 0.387305 | 1.000000 |
420 rows × 420 columns
The correlation matrix for the year 2010 above shows the correlation coefficients between our variables. Which in our case these variables are the log returns of stocks A, B, and C. The correlation coefficient, denoted by corr, measures the linear relationship between two variables. It ranges from -1 to 1:
corr = 1 indicates a perfect positive correlation.
corr = −1 indicates a perfect negative correlation.
corr = 0 indicates no linear correlation.
the matrix is symmetric with diagonal elements equal to 1 (since each asset is perfectly correlated with itself).
figure, axes = plt.subplots(7, 2, figsize=(30, 30))
# Adding the heatmaps and setting titles with the corresponding year
sns.heatmap(return_correlation_2010, ax=axes[0, 0])
axes[0, 0].set_title('2010')
sns.heatmap(return_correlation_2011, ax=axes[0, 1])
axes[0, 1].set_title('2011')
sns.heatmap(return_correlation_2012, ax=axes[1, 0])
axes[1, 0].set_title('2012')
sns.heatmap(return_correlation_2013, ax=axes[1, 1])
axes[1, 1].set_title('2013')
sns.heatmap(return_correlation_2014, ax=axes[2, 0])
axes[2, 0].set_title('2014')
sns.heatmap(return_correlation_2015, ax=axes[2, 1])
axes[2, 1].set_title('2015')
sns.heatmap(return_correlation_2016, ax=axes[3, 0])
axes[3, 0].set_title('2016')
sns.heatmap(return_correlation_2017, ax=axes[3, 1])
axes[3, 1].set_title('2017')
sns.heatmap(return_correlation_2018, ax=axes[4, 0])
axes[4, 0].set_title('2018')
sns.heatmap(return_correlation_2019, ax=axes[4, 1])
axes[4, 1].set_title('2019')
sns.heatmap(return_correlation_2020, ax=axes[5, 0])
axes[5, 0].set_title('2020')
sns.heatmap(return_correlation_2021, ax=axes[5, 1])
axes[5, 1].set_title('2021')
sns.heatmap(return_correlation_2022, ax=axes[6, 0])
axes[6, 0].set_title('2022')
sns.heatmap(return_correlation_2023, ax=axes[6, 1])
axes[6, 1].set_title('2023')
# Adjusting layout for better display
plt.tight_layout()
# Display the plot
plt.show()
The first plot for the year 2010 and especially the second plot of 2011 shows that there is high correlation among the stocks. It shows that since in 2011 there was a market crash and there was volatility in the market, the prices of the stock went down along with the other stocks and this is the reason for high correlation.
Similarly in 2012, 2014, 2017 the market is stabilizing and hence the correlation among stocks is low.
In 2020, due to the COVID pandemic and the volatility in the market the prices of the stock went down or up along with other stocks and this is why we observe high correlation.
From this we can see that In stable market conditions correlation matrices have low correlation values whereas in critical market conditions correlation matrices have high correlation values.
Now based on the above observations i am going to use Networks (which sometimes in the code i will refer to them as Graph from Graph Theory) that will help to understand more sepecific patterns and their implications in the context of the SNP-500 market condition.
On the below approach we can visualize the relationships (correlations) between the stocks in 2010 as a graph (nodes represent stocks and edges represent correlations between them).
graph_2010 = nx.Graph(return_correlation_2010)
#Plot the Graph
figure = plt.figure(figsize=(24, 12))
nx.draw_networkx(graph_2010, with_labels=False,node_size=30,width=0.01)
plt.title('Network of S&P 500 Stock Market (2010)', fontsize=15)
plt.show()
A fully connected network has been created through the correlation matrix. Meaning that every variable has connections with all the other variables including self-loops. From the plot we also realise that is almost impossible to read the Network and understand what is going on.
The stock market is a big network where each stock is connected to every other stock creating a very complex picture.
To simplify i will use a technique called Minimum Spanning Tree (MST), it's like drawing a map of the market but only keeping the most important roads between stocks. This helps focusing on the key relationships and understanding the market better.
Note:While removing redundancy and noise in the data using MST it might lose some information as well.
Before the construction of the MST graph the correlation coefficient is converted into a distance.The transformation is a mathematical way to convert the correlation coefficient into a measure that behaves like a distance in Euclidean space. $$\text{distance} = \sqrt{2\times(1 -\text{correlation})}$$
# Convert the correlation coefficient
distance_2010 = np.sqrt(2 * (1 - return_correlation_2010))
distance_2011 = np.sqrt(2 * (1 - return_correlation_2011))
distance_2012 = np.sqrt(2 * (1 - return_correlation_2012))
distance_2013 = np.sqrt(2 * (1 - return_correlation_2013))
distance_2014 = np.sqrt(2 * (1 - return_correlation_2014))
distance_2015 = np.sqrt(2 * (1 - return_correlation_2015))
distance_2016 = np.sqrt(2 * (1 - return_correlation_2016))
distance_2017 = np.sqrt(2 * (1 - return_correlation_2017))
distance_2018 = np.sqrt(2 * (1 - return_correlation_2018))
distance_2019 = np.sqrt(2 * (1 - return_correlation_2019))
distance_2020 = np.sqrt(2 * (1 - return_correlation_2020))
distance_2021 = np.sqrt(2 * (1 - return_correlation_2021))
distance_2022 = np.sqrt(2 * (1 - return_correlation_2022))
distance_2023 = np.sqrt(2 * (1 - return_correlation_2023))
distance_2010.head(5)
| MMM | AOS | ABT | ACN | ADM | ADBE | ADP | AAP | AES | AFL | ... | WY | WHR | WMB | WTW | WYNN | XEL | YUM | ZBRA | ZBH | ZION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MMM | 0.000000 | 0.901683 | 1.015946 | 1.013633 | 1.065700 | 1.153542 | 0.804463 | 1.222126 | 0.912571 | 0.792162 | ... | 1.314461 | 0.868334 | 0.870513 | 0.975658 | 0.957215 | 0.901946 | 0.906855 | 0.973747 | 0.927566 | 1.014627 |
| AOS | 0.901683 | 0.000000 | 1.095930 | 1.143672 | 1.163593 | 1.178351 | 0.972188 | 1.192765 | 0.975223 | 0.903831 | ... | 1.271785 | 0.969575 | 0.939591 | 1.058488 | 1.041572 | 1.031533 | 0.990189 | 0.972692 | 1.106123 | 1.043806 |
| ABT | 1.015946 | 1.095930 | 0.000000 | 1.052850 | 1.056337 | 1.191308 | 1.004617 | 1.298851 | 1.012364 | 0.967339 | ... | 1.325669 | 1.048039 | 1.000373 | 1.081411 | 1.092534 | 0.937597 | 1.047442 | 1.085015 | 1.093735 | 1.080728 |
| ACN | 1.013633 | 1.143672 | 1.052850 | 0.000000 | 1.131331 | 1.220346 | 0.961415 | 1.278093 | 1.042082 | 0.978368 | ... | 1.282762 | 1.027600 | 1.044252 | 1.000761 | 1.056228 | 1.066891 | 0.989510 | 1.059252 | 1.069464 | 1.100562 |
| ADM | 1.065700 | 1.163593 | 1.056337 | 1.131331 | 0.000000 | 1.256732 | 0.993180 | 1.196457 | 1.068909 | 1.010650 | ... | 1.370138 | 1.107022 | 1.040956 | 1.089559 | 1.140173 | 1.019397 | 0.991642 | 1.056167 | 1.145328 | 1.176001 |
5 rows × 420 columns
The code below transforms distance matrices into Networks representations for each year. These Networks serve as the basis for further analysis such as generating Minimum Spanning Trees (MST)
# Transforms the distance matrices into Networks for every year
distance_2010_graph = nx.Graph(distance_2010)
distance_2011_graph = nx.Graph(distance_2011)
distance_2012_graph = nx.Graph(distance_2012)
distance_2013_graph = nx.Graph(distance_2013)
distance_2014_graph = nx.Graph(distance_2014)
distance_2015_graph = nx.Graph(distance_2015)
distance_2016_graph = nx.Graph(distance_2016)
distance_2017_graph = nx.Graph(distance_2017)
distance_2018_graph = nx.Graph(distance_2018)
distance_2019_graph = nx.Graph(distance_2019)
distance_2020_graph = nx.Graph(distance_2020)
distance_2021_graph = nx.Graph(distance_2021)
distance_2022_graph = nx.Graph(distance_2022)
distance_2023_graph = nx.Graph(distance_2023)
graph_2010_filtered = nx.minimum_spanning_tree(distance_2010_graph)
graph_2011_filtered = nx.minimum_spanning_tree(distance_2011_graph)
graph_2012_filtered = nx.minimum_spanning_tree(distance_2012_graph)
graph_2013_filtered = nx.minimum_spanning_tree(distance_2013_graph)
graph_2014_filtered = nx.minimum_spanning_tree(distance_2014_graph)
graph_2015_filtered = nx.minimum_spanning_tree(distance_2015_graph)
graph_2016_filtered = nx.minimum_spanning_tree(distance_2016_graph)
graph_2017_filtered = nx.minimum_spanning_tree(distance_2017_graph)
graph_2018_filtered = nx.minimum_spanning_tree(distance_2018_graph)
graph_2019_filtered = nx.minimum_spanning_tree(distance_2019_graph)
graph_2020_filtered = nx.minimum_spanning_tree(distance_2020_graph)
graph_2021_filtered = nx.minimum_spanning_tree(distance_2021_graph)
graph_2022_filtered = nx.minimum_spanning_tree(distance_2022_graph)
graph_2023_filtered = nx.minimum_spanning_tree(distance_2023_graph)
figure, axes = plt.subplots(14, 1, figsize=(24, 120))
nx.draw_networkx(graph_2011_filtered, with_labels=False, ax=axes[1],node_size=200)
axes[1].set_title('MST Based Network Graph for 2011', fontsize=16)
nx.draw_networkx(graph_2012_filtered, with_labels=False, ax=axes[2],node_size=200)
axes[2].set_title('MST Based Network Graph for 2012', fontsize=16)
nx.draw_networkx(graph_2013_filtered, with_labels=False, ax=axes[3],node_size=200)
axes[3].set_title('MST Based Network Graph for 2013', fontsize=16)
nx.draw_networkx(graph_2014_filtered, with_labels=False, ax=axes[4],node_size=200)
axes[4].set_title('MST Based Network Graph for 2014', fontsize=16)
nx.draw_networkx(graph_2015_filtered, with_labels=False, ax=axes[5],node_size=200)
axes[5].set_title('MST Based Network Graph for 2015', fontsize=16)
nx.draw_networkx(graph_2016_filtered, with_labels=False, ax=axes[6],node_size=200)
axes[6].set_title('MST Based Network Graph for 2016', fontsize=16)
nx.draw_networkx(graph_2017_filtered, with_labels=False, ax=axes[7],node_size=200)
axes[7].set_title('MST Based Network Graph for 2017', fontsize=16)
nx.draw_networkx(graph_2018_filtered, with_labels=False, ax=axes[8],node_size=200)
axes[8].set_title('MST Based Network Graph for 2018', fontsize=16)
nx.draw_networkx(graph_2019_filtered, with_labels=False, ax=axes[9],node_size=200)
axes[9].set_title('MST Based Network Graph for 2019', fontsize=16)
nx.draw_networkx(graph_2020_filtered, with_labels=False, ax=axes[10],node_size=200)
axes[10].set_title('MST Based Network Graph for 2020', fontsize=16)
nx.draw_networkx(graph_2021_filtered, with_labels=False, ax=axes[11],node_size=200)
axes[11].set_title('MST Based Network Graph for 2021', fontsize=16)
nx.draw_networkx(graph_2022_filtered, with_labels=False, ax=axes[12],node_size=200)
axes[12].set_title('MST Based Network Graph for 2022', fontsize=16)
nx.draw_networkx(graph_2023_filtered, with_labels=False, ax=axes[13],node_size=200)
axes[13].set_title('MST Based Network Graph for 2023', fontsize=16)
plt.tight_layout()
plt.show()
The produced MST graphs are much easier to read and there are no cycles any more.
The code below will calculate the average shortest path length for the (MST) graphs and store these values in a list for further analysis.
#Create empty lists
average_degree_connectivity = []
average_shortest_path_length = []
year = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022,2023]
# Create a loop Through Each Year’s Graph:
for graph in [graph_2010_filtered, graph_2011_filtered, graph_2012_filtered, graph_2013_filtered, graph_2014_filtered, graph_2015_filtered,
graph_2016_filtered, graph_2017_filtered, graph_2018_filtered, graph_2019_filtered, graph_2020_filtered, graph_2021_filtered, graph_2022_filtered,graph_2023_filtered]:
# nx.average_shortest_path_length(graph) function from the NetworkX library is used to calculate the average shortest path length
average_shortest_path_length.append(nx.average_shortest_path_length(graph))
# Plot the
figure = plt.figure(figsize=(22, 8))
sns.lineplot(x='year', y='average_shortest_path_length',
data=pd.DataFrame({'year': year, 'average_shortest_path_length': average_shortest_path_length}));
From the above plot we can see that the shortest path length was more stable till 2015 but there was significant increment in 2016 and 2017 and again there was a decrement in 2018. In 2020 there an increment again that drops abruptly in 2022 and an increment again untl today. Maybe this has some correlation with financial concerns.
At the folowing notebook section i will construct and visualizing a network (or graph) based on the relationships between stocks in the stock market from 2010 to 2023.
# Calculate the log returns of the stocks over time 2010 to 2023
log_returns_2010_till_2023 = np.log(stock_data_cleaned.shift(1)) - np.log(stock_data_cleaned)
# Calculate the correlation matrix of the log returns 2010 to 2023
return_correlation_2010_till_2023 = log_returns_2010_till_2023.corr()
# Plot the correlation matrix as heatmap
figure = plt.figure(figsize=(24, 8))
sns.heatmap(return_correlation_2010_till_2023);
distance_2010_till_2023 = np.sqrt(2 * (1 - return_correlation_2010_till_2023))
distance_2010_till_2023_graph = nx.Graph(distance_2010_till_2023)
distance_2010_till_2023_graph_filtered = nx.minimum_spanning_tree(distance_2010_till_2023_graph)
# Get the number of nodes in the graph
num_nodes = distance_2010_till_2023_graph_filtered.number_of_nodes()
print("Number of nodes:", num_nodes)
# Get the number of edges in the graph
num_edges = distance_2010_till_2023_graph_filtered.number_of_edges()
print("Number of edges:", num_edges)
# Check if the graph is connected
is_connected = nx.is_connected(distance_2010_till_2023_graph_filtered)
print("Is the graph connected?", is_connected)
Number of nodes: 420 Number of edges: 419 Is the graph connected? True
figure = plt.figure(figsize=(24, 12))
# Lets visualise the plot using the nx.draw_kamada_kawai function from the NetworkX library in order to make the visualization more interpretable.
nx.draw_kamada_kawai(distance_2010_till_2023_graph_filtered, with_labels=False,node_size=200)
# Display the plot
plt.title('MST Graph of S&P 500 Stock Market (2010-2023)', fontsize=20)
plt.show()
Note: The graph is conected and has 420 nodes (stocks/companies), 419 edges, but in the plot we perceive a part of it since it has been adjusted for visualisation purposes to the plot size (a bigger plot will potentially reveal more nodes and edges).
Calculating the centrality measures in the network will help to understand the importance or influence of the nodes within the graph.
Which will help measuring the number of direct connections our nodes have since nodes with higher degree centrality are often considered more influential because they are directly connected to many other nodes.
degree_centrality = nx.degree_centrality(distance_2010_till_2023_graph_filtered)
# Create the Bar chart showing the Degree Centralty among stocks
keys = []
values = []
for key, value in degree_centrality.items():
keys.append(key)
values.append(value)
dc_data = pd.DataFrame({'stocks': keys, 'degree_centrality': values}).sort_values('degree_centrality', ascending=False)
px.bar(data_frame=dc_data, x='stocks', y='degree_centrality', template='plotly_dark')
from IPython.display import Image, display
# Display the image as a PNG for the HTML
display(Image(filename='newplot.png'))
Which will help to measure how close a node is to all other nodes in the graph since nodes with high closeness centrality can quickly interact with all other nodes in the network making them efficient in passing information.
closeness_centrality = nx.closeness_centrality(distance_2010_till_2023_graph_filtered)
keys = []
values = []
for key, value in closeness_centrality.items():
keys.append(key)
values.append(value)
cc_data = pd.DataFrame({'stocks': keys, 'closeness_centrality': values}).sort_values('closeness_centrality',
ascending=False)
px.bar(data_frame=cc_data, x='stocks', y='closeness_centrality', template='plotly_dark')
from IPython.display import Image, display
# Display the image as a PNG for the HTML
display(Image(filename='newplot (1).png'))
Which helps measuring how often a stock (node) appears on the shortest paths between other stocks in a market network. It shows how much control a stock has over the flow of information in the market.
betweenness_centrality = nx.betweenness_centrality(distance_2010_till_2023_graph_filtered)
keys = []
values = []
for key, value in betweenness_centrality.items():
keys.append(key)
values.append(value)
bc_data = pd.DataFrame({'stocks': keys, 'betweenness_centrality': values}).sort_values('betweenness_centrality',
ascending=False)
px.bar(data_frame=bc_data, x='stocks', y='betweenness_centrality', template='plotly_dark')
from IPython.display import Image, display
# Display the image as a PNG for the HTML
display(Image(filename='newplot (2).png'))
In networks as in finance not all connections are equal. A connection to a highly influential node(stock) is more valuable than a connection to a less influential one. Eigenvector centrality captures this nuance by assigning higher centrality scores to nodes(stocks) that are connected to other nodes(stocks)with high centrality scores.
eigenvector_centrality=nx.eigenvector_centrality_numpy(distance_2010_till_2023_graph_filtered)
keys = []
values = []
for key, value in eigenvector_centrality.items():
keys.append(key)
values.append(value)
bc_data = pd.DataFrame({'stocks': keys, 'eigenvector_centrality': values}).sort_values('eigenvector_centrality',ascending=False)
px.bar(data_frame=bc_data, x='stocks', y='eigenvector_centrality', template='plotly_dark')
from IPython.display import Image, display
# Display the image as a PNG for the HTML
display(Image(filename='/content/drive/newplot (3).png'))
So far the centrality analysis across various metrics (degree, closeness, betweenness, and eigenvector centrality) reveals that a small number of key stocks dominate the market network's structure and influence. These stocks are crucial in connecting different market segments, spreading information efficiently, and maintaining overall market stability.
The following process will rank the stocks based on their importance in the network by measuring different distances between them and combining these with the previous calulated centrality measures to create a single score for each stock.
Calculate the shortest path from each stock (node) to the stock with the highest degree centrality.
# Distance on Degree Criterion
# Initialize dictionary to store shortest path lengths to the node with the highest degree centrality
distance_degree_criteria = {}
# Identify the node with the largest degree centrality (most connections)
node_with_largest_degree_centrality = max(dict(degree_centrality), key=dict(degree_centrality).get)
# Calculate and store the shortest path length from each node to the node with the largest degree centrality
for node in distance_2010_till_2023_graph_filtered.nodes():
distance_degree_criteria[node] = nx.shortest_path_length(distance_2010_till_2023_graph_filtered, node, node_with_largest_degree_centrality)
Calculate the shortest path from each stock to the stock that has the strongest overall correlation with its neighboring stocks, and then measure how closely each other stock is connected to this highly correlated stock.
# Distance on Correlation Criterion
# Initialize dictionaries to store shortest path lengths and sum of correlations
distance_correlation_criteria = {}
sum_correlation = {}
# Calculate the sum of correlations for each node with its neighbors
for node in distance_2010_till_2023_graph_filtered.nodes():
neighbors = nx.neighbors(distance_2010_till_2023_graph_filtered, node)
sum_correlation[node] = sum(return_correlation_2010_till_2023.loc[node, neighbor] for neighbor in neighbors)
# Identify the node with the highest sum of correlations with its neighbors
node_with_highest_correlation = max(sum_correlation, key=sum_correlation.get)
# Calculate and store the shortest path length from each node to the node with the highest correlation
for node in distance_2010_till_2023_graph_filtered.nodes():
distance_correlation_criteria[node] = nx.shortest_path_length(distance_2010_till_2023_graph_filtered, node, node_with_highest_correlation)
Calculate the shortest path from each stock to the stock with the minimum mean distance to all other nodes.
# Distance on Distance Criterion
# Initialize dictionaries to store shortest path lengths and mean distances
distance_distance_criteria = {}
mean_distance = {}
# Calculate the mean shortest path distance from each node to all other nodes
for node in distance_2010_till_2023_graph_filtered.nodes():
nodes = list(distance_2010_till_2023_graph_filtered.nodes())
nodes.remove(node)
distance_distance = [nx.shortest_path_length(distance_2010_till_2023_graph_filtered, node, ns) for ns in nodes]
mean_distance[node] = np.mean(distance_distance)
# Identify the node with the minimum mean distance to all other nodes
node_with_minimum_mean_distance = min(mean_distance, key=mean_distance.get)
# Calculate and store the shortest path length from each node to the node with the minimum mean distance
for node in distance_2010_till_2023_graph_filtered.nodes():
distance_distance_criteria[node] = nx.shortest_path_length(distance_2010_till_2023_graph_filtered, node, node_with_minimum_mean_distance)
# Store the results in a DataFrame
# Create a DataFrame from the degree centrality dictionary
node_stats = pd.DataFrame.from_dict(dict(degree_centrality), orient='index')
# Add betweenness centrality to the DataFrame
node_stats.columns = ['degree_centrality']
# Add the distance criteria results to the DataFrame
node_stats['betweenness_centrality'] = pd.Series(betweenness_centrality)
node_stats['distance_degree_criteria'] = pd.Series(distance_degree_criteria)
node_stats['distance_correlation_criteria'] = pd.Series(distance_correlation_criteria)
node_stats['distance_distance_criteria'] = pd.Series(distance_distance_criteria)
# Calculate the average of the distance criteria and store it in the DataFrame
node_stats['average_distance'] = (
node_stats['distance_degree_criteria'] + node_stats['distance_correlation_criteria'] + node_stats['distance_distance_criteria']) / 3
This will allow further analysis of the network helping to identify not only the most connected(Closeness) but also the most influential(Eigenvector) stocks in the network.
# Add closeness centrality and eigenvector centrality to the DataFrame
node_stats['closeness_centrality'] = pd.Series(closeness_centrality)
node_stats['eigenvector_centrality'] = pd.Series(eigenvector_centrality) # Add this line to include eigenvector centrality
# Centrality Normalization
# Normalize degree centrality
node_stats['degree_centrality_norm'] = (
node_stats['degree_centrality'] - node_stats['degree_centrality'].min()) / (node_stats['degree_centrality'].max() - node_stats['degree_centrality'].min())
# Normalize closeness centrality
node_stats['closeness_centrality_norm'] = (
node_stats['closeness_centrality'] - node_stats['closeness_centrality'].min()) / (node_stats['closeness_centrality'].max() - node_stats['closeness_centrality'].min())
# Normalize betweenness centrality
node_stats['betweenness_centrality_norm'] = (
node_stats['betweenness_centrality'] - node_stats['betweenness_centrality'].min()) / (node_stats['betweenness_centrality'].max() - node_stats['betweenness_centrality'].min())
# Normalize eigenvector centrality
node_stats['eigenvector_centrality_norm'] = (
node_stats['eigenvector_centrality'] - node_stats['eigenvector_centrality'].min()) / (node_stats['eigenvector_centrality'].max() - node_stats['eigenvector_centrality'].min())
# Combined Centrality Score
# Calculate the combined centrality score as the average of all normalized centralities
node_stats['combined_centrality'] = (node_stats['degree_centrality_norm'] +
node_stats['closeness_centrality_norm'] +
node_stats['betweenness_centrality_norm'] +
node_stats['eigenvector_centrality_norm']) / 4
This code below will create a final score for each stock by combining centrality and distance giving 70% weight to centrality and 30% to distance. Then it sorts the stocks from highest to lowest based on this final score.
# Final Combined Score (Weighted)
final_weighted_score = 0.7 * node_stats['combined_centrality'] + 0.3 * (1 / node_stats['average_distance'])
node_stats['final_score'] = final_weighted_score
# Sort based on the final score
node_stats_sorted = node_stats.sort_values(by='final_score', ascending=False)
Lets select the 15 most central stocks and the 15 most peripheral stocks based on their final scores.
# Select the top 15 central stocks
central_stocks = node_stats_sorted.head(15)
central_portfolio = [stock for stock in central_stocks.index.values]
# Select the last 15 peripheral stocks
peripheral_stocks = node_stats_sorted.tail(15)
peripheral_portfolio = [stock for stock in peripheral_stocks.index.values]
color = []
for node in distance_2010_till_2023_graph_filtered:
if node in central_portfolio:
color.append('red')
elif node in peripheral_portfolio:
color.append('green')
else:
color.append('blue')
# Plot the graph using the Kamada-Kawai layout
figure, ax = plt.subplots(figsize=(24, 12))
nx.draw_kamada_kawai(distance_2010_till_2023_graph_filtered, with_labels=False, node_color=color, ax=ax,node_size=100)
# Create a legend directly
ax.legend(handles=[plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Central Portfolio'),
plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='green', markersize=10, label='Peripheral Portfolio'),
plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10, label='S&P 500')],
loc='upper left')
# Add a title to the plot
ax.set_title('Network Graph Visualization of Central and Peripheral Stocks (2010-2023)', fontsize=16)
# Display the plot
plt.show()
#Reading data for 2024 S&P 500 stocks:
stock_data_2024 = pd.read_csv('/content/drive/SP500_price_data_2024_till_August.csv', index_col=[0])
stock_data_2024
| MMM | AOS | ABT | ABBV | ABMD | ACN | ATVI | ADM | ADBE | ADP | ... | WYNN | XEL | XYL | YUM | ZBRA | ZBH | ZION | ZTS | CEG | OGN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2024-01-02 | 91.973244 | 81.440002 | 109.849998 | 159.820007 | NaN | 346.920013 | NaN | 72.760002 | 580.070007 | 233.419998 | ... | 94.599998 | 63.470001 | 113.669998 | 129.029999 | 267.980011 | 121.389999 | 44.139999 | 196.570007 | 115.250000 | 14.710000 |
| 2024-01-03 | 90.125420 | 78.709999 | 109.519997 | 160.460007 | NaN | 337.920013 | NaN | 73.129997 | 571.789978 | 232.509995 | ... | 94.029999 | 63.580002 | 111.559998 | 129.100006 | 252.520004 | 120.470001 | 42.000000 | 192.929993 | 115.849998 | 14.630000 |
| 2024-01-04 | 90.443146 | 79.379997 | 110.980003 | 161.460007 | NaN | 337.089996 | NaN | 71.750000 | 567.049988 | 233.660004 | ... | 94.209999 | 63.730000 | 112.330002 | 128.690002 | 252.970001 | 120.199997 | 42.639999 | 194.039993 | 115.360001 | 13.780000 |
| 2024-01-05 | 90.794312 | 79.040001 | 110.800003 | 162.139999 | NaN | 336.619995 | NaN | 70.809998 | 564.599976 | 235.210007 | ... | 95.650002 | 63.730000 | 112.150002 | 128.339996 | 252.690002 | 119.980003 | 44.049999 | 194.850006 | 116.239998 | 14.410000 |
| 2024-01-08 | 91.020065 | 79.269997 | 112.400002 | 161.429993 | NaN | 340.350006 | NaN | 71.059998 | 580.549988 | 236.759995 | ... | 97.050003 | 63.580002 | 112.849998 | 128.639999 | 261.089996 | 121.970001 | 44.540001 | 196.149994 | 117.029999 | 16.299999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-08-16 | 127.050003 | 79.650002 | 111.000000 | 193.899994 | NaN | 326.920013 | NaN | 59.090000 | 553.460022 | 264.089996 | ... | 76.290001 | 59.980000 | 132.800003 | 137.399994 | 344.140015 | 110.650002 | 46.590000 | 183.710007 | 189.979996 | 20.350000 |
| 2024-08-19 | 127.080002 | 80.699997 | 111.900002 | 196.869995 | NaN | 329.440002 | NaN | 59.680000 | 563.119995 | 264.649994 | ... | 76.599998 | 60.150002 | 134.380005 | 136.690002 | 347.019989 | 111.769997 | 47.000000 | 184.479996 | 192.779999 | 20.959999 |
| 2024-08-20 | 127.699997 | 80.800003 | 110.769997 | 196.149994 | NaN | 330.369995 | NaN | 59.009998 | 562.250000 | 264.540009 | ... | 76.019997 | 59.990002 | 134.149994 | 136.949997 | 344.850006 | 111.639999 | 46.360001 | 183.600006 | 190.720001 | 20.799999 |
| 2024-08-21 | 129.229996 | 81.879997 | 111.389999 | 196.529999 | NaN | 333.600006 | NaN | 59.220001 | 565.789978 | 267.850006 | ... | 77.360001 | 59.959999 | 135.800003 | 137.750000 | 345.000000 | 112.129997 | 47.070000 | 182.899994 | 196.740005 | 21.030001 |
| 2024-08-22 | 128.600006 | 81.129997 | 112.099998 | 196.369995 | NaN | 330.570007 | NaN | 59.250000 | 557.440002 | 269.380005 | ... | 77.089996 | 59.919998 | 135.419998 | 136.759995 | 342.160004 | 113.419998 | 47.680000 | 182.169998 | 194.949997 | 20.889999 |
162 rows × 504 columns
The data for the S&P 500 index is fetched from the Federal Reserve Economic Data (FRED)
# Get the S&P500 historical data for the year 2024 from pandas_datareader
snp_500_2024 = web.DataReader(['sp500'], 'fred', start='2024-01-01', end='2024-08-22')
snp_500_2024
| sp500 | |
|---|---|
| DATE | |
| 2024-01-01 | NaN |
| 2024-01-02 | 4742.83 |
| 2024-01-03 | 4704.81 |
| 2024-01-04 | 4688.68 |
| 2024-01-05 | 4697.24 |
| ... | ... |
| 2024-08-16 | 5554.25 |
| 2024-08-19 | 5608.25 |
| 2024-08-20 | 5597.12 |
| 2024-08-21 | 5620.85 |
| 2024-08-22 | 5570.64 |
169 rows × 1 columns
Note: The NaN (Not a Number) on 2024-01-01 indicates that the stock market was closed on that day (possibly due to a holiday) so no trading took place and no closing value is available.
Proceeding with data alignment to align and merg our two datasets.
# Drop columns with any NA values
stock_data_2024 = stock_data_2024.dropna(axis=1)
snp_500_2024 = snp_500_2024.dropna()
# Convert indices to datetime to ensuring proper alignment by date
stock_data_2024.index = pd.to_datetime(stock_data_2024.index)
snp_500_2024.index = pd.to_datetime(snp_500_2024.index)
# Find common dates between the stock data and the S&P 500 data
common_dates = stock_data_2024.index.intersection(snp_500_2024.index)
# Select from both dataframes to include only the common dates
stock_data_2024 = stock_data_2024.loc[common_dates]
snp_500_2024 = snp_500_2024.loc[common_dates]
snp_500_2024.shape
(162, 1)
stock_data_2024.shape
(162, 475)
stock_data_2024
| MMM | AOS | ABT | ABBV | ACN | ADM | ADBE | ADP | AAP | AES | ... | WYNN | XEL | XYL | YUM | ZBRA | ZBH | ZION | ZTS | CEG | OGN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024-01-02 | 91.973244 | 81.440002 | 109.849998 | 159.820007 | 346.920013 | 72.760002 | 580.070007 | 233.419998 | 61.950001 | 19.340000 | ... | 94.599998 | 63.470001 | 113.669998 | 129.029999 | 267.980011 | 121.389999 | 44.139999 | 196.570007 | 115.250000 | 14.710000 |
| 2024-01-03 | 90.125420 | 78.709999 | 109.519997 | 160.460007 | 337.920013 | 73.129997 | 571.789978 | 232.509995 | 61.000000 | 18.930000 | ... | 94.029999 | 63.580002 | 111.559998 | 129.100006 | 252.520004 | 120.470001 | 42.000000 | 192.929993 | 115.849998 | 14.630000 |
| 2024-01-04 | 90.443146 | 79.379997 | 110.980003 | 161.460007 | 337.089996 | 71.750000 | 567.049988 | 233.660004 | 60.490002 | 18.870001 | ... | 94.209999 | 63.730000 | 112.330002 | 128.690002 | 252.970001 | 120.199997 | 42.639999 | 194.039993 | 115.360001 | 13.780000 |
| 2024-01-05 | 90.794312 | 79.040001 | 110.800003 | 162.139999 | 336.619995 | 70.809998 | 564.599976 | 235.210007 | 61.520000 | 18.799999 | ... | 95.650002 | 63.730000 | 112.150002 | 128.339996 | 252.690002 | 119.980003 | 44.049999 | 194.850006 | 116.239998 | 14.410000 |
| 2024-01-08 | 91.020065 | 79.269997 | 112.400002 | 161.429993 | 340.350006 | 71.059998 | 580.549988 | 236.759995 | 61.910000 | 19.070000 | ... | 97.050003 | 63.580002 | 112.849998 | 128.639999 | 261.089996 | 121.970001 | 44.540001 | 196.149994 | 117.029999 | 16.299999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-08-16 | 127.050003 | 79.650002 | 111.000000 | 193.899994 | 326.920013 | 59.090000 | 553.460022 | 264.089996 | 62.189999 | 17.260000 | ... | 76.290001 | 59.980000 | 132.800003 | 137.399994 | 344.140015 | 110.650002 | 46.590000 | 183.710007 | 189.979996 | 20.350000 |
| 2024-08-19 | 127.080002 | 80.699997 | 111.900002 | 196.869995 | 329.440002 | 59.680000 | 563.119995 | 264.649994 | 60.900002 | 17.370001 | ... | 76.599998 | 60.150002 | 134.380005 | 136.690002 | 347.019989 | 111.769997 | 47.000000 | 184.479996 | 192.779999 | 20.959999 |
| 2024-08-20 | 127.699997 | 80.800003 | 110.769997 | 196.149994 | 330.369995 | 59.009998 | 562.250000 | 264.540009 | 60.810001 | 17.379999 | ... | 76.019997 | 59.990002 | 134.149994 | 136.949997 | 344.850006 | 111.639999 | 46.360001 | 183.600006 | 190.720001 | 20.799999 |
| 2024-08-21 | 129.229996 | 81.879997 | 111.389999 | 196.529999 | 333.600006 | 59.220001 | 565.789978 | 267.850006 | 61.919998 | 17.670000 | ... | 77.360001 | 59.959999 | 135.800003 | 137.750000 | 345.000000 | 112.129997 | 47.070000 | 182.899994 | 196.740005 | 21.030001 |
| 2024-08-22 | 128.600006 | 81.129997 | 112.099998 | 196.369995 | 330.570007 | 59.250000 | 557.440002 | 269.380005 | 51.099998 | 17.770000 | ... | 77.089996 | 59.919998 | 135.419998 | 136.759995 | 342.160004 | 113.419998 | 47.680000 | 182.169998 | 194.949997 | 20.889999 |
162 rows × 475 columns
In order to calculate the value over time of the central and peripheral portfolios i am using a representative amount as an initial investment of 10.000$ showing how the value of each portfolio changes day by day based on the initial investment. (I chose 10.000 as amount since it will help to easily visualize the plot at the end).
# Initial investment amount
amount = 10000
# Creating an empty DataFrame to store the values of stocks in the central portfolio
central_portfolio_value = pd.DataFrame()
for stock in central_portfolio:
central_portfolio_value[stock] = stock_data_2024[stock]
portfolio_unit = central_portfolio_value.sum(axis=1)[0] # calculates the sum of the first row to found the initial value of the portfolio.
share = amount / portfolio_unit # Find the shares by dividing the initial investment by the initial portfolio value
central_portfolio_value = central_portfolio_value.sum(axis=1) * share # Calculate the value of the central portfolio over time
# Similar is the process for the Peripheral Portfolio Value Calculation
peripheral_portfolio_value = pd.DataFrame()
for stock in peripheral_portfolio:
peripheral_portfolio_value[stock] = stock_data_2024[stock]
portfolio_unit = peripheral_portfolio_value.sum(axis=1)[0]
share = amount / portfolio_unit
peripheral_portfolio_value = peripheral_portfolio_value.sum(axis=1) * share
Scaling the S&P 500 index values to reflect how much the initial investment (amount) would grow (or shrink) over time.The below calculation determine the number of "units" of the S&P 500 index someone could buy with the initial investment amount (10.000$).
# Scale the S&P 500 index values to reflect an initial investment amount
snp_500_2024_value = snp_500_2024 * (amount / snp_500_2024.iloc[0])
all_portfolios = snp_500_2024_value
all_portfolios['central_portfolio'] = central_portfolio_value.values
all_portfolios['peripheral_portfolio'] = peripheral_portfolio_value.values
all_portfolios.tail()
| sp500 | central_portfolio | peripheral_portfolio | |
|---|---|---|---|
| 2024-08-16 | 11710.835092 | 10960.916494 | 10045.654823 |
| 2024-08-19 | 11824.691165 | 11031.035246 | 10190.822537 |
| 2024-08-20 | 11801.224164 | 10988.220295 | 10220.127797 |
| 2024-08-21 | 11851.257582 | 11069.043105 | 10223.182570 |
| 2024-08-22 | 11745.392519 | 11048.649683 | 10161.574208 |
figure, ax = plt.subplots(figsize=(16, 8))
# Plot S&P 500 values
snp_500_line = ax.plot(all_portfolios['sp500'], label='S&P 500')
# Plot Central Portfolio values
central_portfolio_line = ax.plot(all_portfolios['central_portfolio'], label= 'Central Portfolio', color= 'red')
# Plot Peripheral Portfolio values
peripheral_portfolio_line = ax.plot(all_portfolios['peripheral_portfolio'], label= 'Peripheral Portfolio', color= 'green')
ax.legend(loc='upper left')
# Title to the plot
ax.set_title('Portfolio Performance Comparison')
# Adding a title to the Y-axis
ax.set_ylabel('Investment Value ')
plt.show()
This case study notebook demonstrates the importance of using network analysis to understand the structure of the stock market. The use of networks in stock market analysis is very efficient because it allows for a more nuanced understanding of the relationships and influences between stocks helping to better understand and analyse the market dynamics.